MISC

# regarding 'readthedown' theme
# https://cran.r-project.org/web/packages/rmdformats/vignettes/introduction.html

Introduction

Check out this Kaggle

This data has been gathered at two solar power plants in India over a 34 day period. It has two pairs of files - each pair has one power generation dataset and one sensor readings dataset. The power generation datasets are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. The sensor data is gathered at a plant level - single array of sensors optimally placed at the plant.

There are a few areas of concern at the solar power plant -

  1. Can we predict the power generation for next couple of days? - this allows for better grid management
  2. Can we identify the need for panel cleaning/maintenance?
  3. Can we identify faulty or sub-optimally performing equipment?

Ideas

  1. arima / stl modeling
  2. anomaly detection, boxplots
  3. related to 2?

DATA DICTIONARY for Power Generation data sets

  1. AC_POWER : Amount of AC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
  2. AC_ : Amount of DC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
  3. DAILY_YIELD : Daily yield is a cumulative sum of power generated on that day, till that point in time.
  4. DATE_TIME : Date and time for each observation. Observations recorded at 15 minute intervals.
  5. PLANT_ID : Plant ID - this will be common for the entire file.
  6. SOURCE_KEY : Source key in this file stands for the inverter id.
  7. TOTAL_YIELD : This is the total yield for the inverter till that point in time.

Power Generation Inverter data set: Plant 1

Get and Clean Data

p1.gd = read_csv('Plant_1_Generation_Data.csv') %>%
  slice_sample(prop = 0.10) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(where(is.factor),where(is.character),where(is.numeric)) #sort cols by data type

#OlsonNames()
#https://stackoverflow.com/questions/41479008/what-is-the-correct-tz-database-time-zone-for-india

p1.gd = p1.gd %>% mutate(
  date_time = as.POSIXct(strptime(p1.gd$date_time, "%d-%m-%Y %H:%M"), tz = 'Asia/Kolkata'),
  source_key = factor(p1.gd$source_key)
) %>% rename(inverter = source_key)

p1.gd$plant_id = NULL

glimpse structure, and sample rows

p1.gd %>% slice_sample(n = 10) %>% DT::datatable()
p1.gd %>% glimpse()
## Rows: 6,877
## Columns: 6
## $ date_time   <dttm> 2020-05-18 01:15:00, 2020-05-18 02:00:00, 2020-05-31 1...
## $ inverter    <fct> YxYtjZvoooNbGkE, zBIq5rxdHJRwDNY, sjndEbLyjtCKgGv, sjnd...
## $ ac_power    <dbl> 0.0000, 0.0000, 542.2500, 934.5375, 0.0000, 0.0000, 178...
## $ daily_yield <dbl> 0.0000, 0.0000, 2367.2500, 1616.5000, 0.0000, 0.0000, 8...
## $ dc_power    <dbl> 0.000, 0.000, 5531.125, 9566.750, 0.000, 0.000, 1819.00...
## $ total_yield <dbl> 7200229, 6359657, 7138622, 7067503, 7163262, 7281035, 7...

check missing values

p1.gd %>% miss_var_summary()
## # A tibble: 6 x 3
##   variable    n_miss pct_miss
##   <chr>        <int>    <dbl>
## 1 date_time        0        0
## 2 inverter         0        0
## 3 ac_power         0        0
## 4 daily_yield      0        0
## 5 dc_power         0        0
## 6 total_yield      0        0

EDA: Factor Vars

counts each factor’s unique levels

sapply(p1.gd %>% select(where(is.factor)), n_unique) %>% as.data.frame()
##           .
## inverter 22

reference: names of unique levels

sapply(p1.gd %>% select(where(is.factor)), unique) %>% as.data.frame() %>% arrange()
##           inverter
## 1  YxYtjZvoooNbGkE
## 2  zBIq5rxdHJRwDNY
## 3  sjndEbLyjtCKgGv
## 4  z9Y9gH1T5YWrNuG
## 5  uHbuxQJl8lW7ozc
## 6  VHMLBKoKgIrUVDU
## 7  wCURE6d3bPkepu2
## 8  ZoEaEvLYb1n2sOq
## 9  iCRJl6heRkivqQ3
## 10 1IF53ai7Xc0U56Y
## 11 7JYdWkrLSPkdwr4
## 12 3PZuoBAID5Wc2HD
## 13 WRmjgnKYAwPKWDb
## 14 ih0vzX44oOqAx2f
## 15 pkci93gMrogZuBj
## 16 bvBOhCH3iADSZry
## 17 ZnxXDlPa8U1GXgE
## 18 adLQvlD726eNBSB
## 19 zVJPv84UY57bAof
## 20 McdE0feGgRqW7Ca
## 21 1BY6WEcLGh8j5v7
## 22 rGa61gmuvPhdLxV

viz: distribution of level counts

jpal = colorRampPalette(brewer.pal(8,'Dark2'))(22)

p1.gd %>% count(inverter) %>% plot_ly(y = ~fct_reorder(inverter,n), x = ~n, color = ~inverter, colors = jpal) %>% add_bars(hoverinfo = 'text', text = ~n) %>% hide_legend() %>% layout(
    title = 'Source Key Counts',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    ) 
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
050100150200250300350iCRJl6heRkivqQ3rGa61gmuvPhdLxVwCURE6d3bPkepu2McdE0feGgRqW7Ca3PZuoBAID5Wc2HD1BY6WEcLGh8j5v7zVJPv84UY57bAof1IF53ai7Xc0U56Yih0vzX44oOqAx2fZoEaEvLYb1n2sOqVHMLBKoKgIrUVDUZnxXDlPa8U1GXgEbvBOhCH3iADSZryYxYtjZvoooNbGkEsjndEbLyjtCKgGvz9Y9gH1T5YWrNuGzBIq5rxdHJRwDNYadLQvlD726eNBSBpkci93gMrogZuBjWRmjgnKYAwPKWDb7JYdWkrLSPkdwr4uHbuxQJl8lW7ozc
Source Key Counts

Sanity Check - PASS - Pretty much a uniform distribution

EDA: Numeric Vars

viz bivariate numeric distribution

DataExplorer::plot_boxplot(p1.gd %>% select(where(is.numeric)), by = 'daily_yield')

viz: numeric univariate distributions

names.numeric = p1.gd %>% select(where(is.numeric)) %>% names

p1.gd %>% dlookr::plot_normality(
  names.numeric[1],
  names.numeric[2],
  names.numeric[3]
  )

The 0s for ac/dc power and daily_yield likely indicate the inverter was powered off for a ‘rest day’ / maintenance

viz: numeric univariate distributions

DataExplorer::plot_density(p1.gd %>% select(where(is.numeric)))

viz: distributions by ‘inverter’ factor

p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = daily_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~daily_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Daily Yield by Inverter')
02000400060008000bvBOhCH3iADSZryZoEaEvLYb1n2sOqsjndEbLyjtCKgGvYxYtjZvoooNbGkE7JYdWkrLSPkdwr4uHbuxQJl8lW7ozc3PZuoBAID5Wc2HDMcdE0feGgRqW7Capkci93gMrogZuBjz9Y9gH1T5YWrNuGih0vzX44oOqAx2fadLQvlD726eNBSBWRmjgnKYAwPKWDb1IF53ai7Xc0U56YiCRJl6heRkivqQ3VHMLBKoKgIrUVDUrGa61gmuvPhdLxVwCURE6d3bPkepu2zBIq5rxdHJRwDNYZnxXDlPa8U1GXgEzVJPv84UY57bAof1BY6WEcLGh8j5v7
Distribution of Daily Yield by Inverter
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = ac_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~ac_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of AC Power by Inverter')
02004006008001000120014007JYdWkrLSPkdwr4zVJPv84UY57bAofih0vzX44oOqAx2fZnxXDlPa8U1GXgEuHbuxQJl8lW7ozcWRmjgnKYAwPKWDbzBIq5rxdHJRwDNYpkci93gMrogZuBjYxYtjZvoooNbGkEMcdE0feGgRqW7CaZoEaEvLYb1n2sOq3PZuoBAID5Wc2HDsjndEbLyjtCKgGvadLQvlD726eNBSB1IF53ai7Xc0U56Yz9Y9gH1T5YWrNuGiCRJl6heRkivqQ3rGa61gmuvPhdLxVVHMLBKoKgIrUVDUwCURE6d3bPkepu21BY6WEcLGh8j5v7bvBOhCH3iADSZry
Distribution of AC Power by Inverter
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = dc_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~dc_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of DC Power by Inverter')
02k4k6k8k10k12k14k7JYdWkrLSPkdwr4zVJPv84UY57bAofih0vzX44oOqAx2fZnxXDlPa8U1GXgEuHbuxQJl8lW7ozcWRmjgnKYAwPKWDbzBIq5rxdHJRwDNYpkci93gMrogZuBjYxYtjZvoooNbGkEMcdE0feGgRqW7CaZoEaEvLYb1n2sOq3PZuoBAID5Wc2HDsjndEbLyjtCKgGvadLQvlD726eNBSB1IF53ai7Xc0U56Yz9Y9gH1T5YWrNuGiCRJl6heRkivqQ3rGa61gmuvPhdLxVVHMLBKoKgIrUVDUwCURE6d3bPkepu21BY6WEcLGh8j5v7bvBOhCH3iADSZry
Distribution of DC Power by Inverter

viz: ‘Rest?/Maintenace? Days’ Check

p1.gd %>% arrange(date_time) %>% group_nest(inverter) %>% unnest
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(data)`
## # A tibble: 6,877 x 6
##    inverter        date_time           ac_power daily_yield dc_power total_yield
##    <fct>           <dttm>                 <dbl>       <dbl>    <dbl>       <dbl>
##  1 1BY6WEcLGh8j5v7 2020-05-15 01:15:00       0           0        0     6259559 
##  2 1BY6WEcLGh8j5v7 2020-05-15 04:15:00       0           0        0     6259559 
##  3 1BY6WEcLGh8j5v7 2020-05-15 08:45:00     418.        556.    4257.    6260115.
##  4 1BY6WEcLGh8j5v7 2020-05-15 09:00:00     559.        689.    5707.    6260248.
##  5 1BY6WEcLGh8j5v7 2020-05-15 10:15:00     650.       1308.    6637.    6260867.
##  6 1BY6WEcLGh8j5v7 2020-05-15 14:30:00     532.       4398.    5430.    6263957.
##  7 1BY6WEcLGh8j5v7 2020-05-15 19:30:00       0        5754        0     6265313 
##  8 1BY6WEcLGh8j5v7 2020-05-16 02:00:00       0           0        0     6265313 
##  9 1BY6WEcLGh8j5v7 2020-05-16 03:30:00       0           0        0     6265313 
## 10 1BY6WEcLGh8j5v7 2020-05-16 11:30:00     718.       2639     7335.    6267952 
## # ... with 6,867 more rows
p1.gd.plots = p1.gd %>% arrange(date_time) %>% group_nest(inverter) %>% mutate(
  plots = map2(
    .x = data,
    .y = inverter,
    ~ggplot(data = .x, aes(date_time, daily_yield, color = daily_yield == 0, group = 1)) +
      geom_line(size = 1.2) + scale_color_manual(values = c('black','red')) + ggtitle(paste0('Inverter: ', .y))
  ))

p1.gd.plots$plots
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

correlations: viz

p1.gd %>% dlookr::plot_correlate()

since dc and ac power are just perfectly convertible (like fahrenheit and celsius), they have a perfect correlation

EDA: Time Series Viz

Anomoly Plot

library(scales)
library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

p1.gd.anomalize = p1.gd %>% arrange(date_time) %>% 
  mutate(inverter = fct_reorder(inverter, -daily_yield)) %>% 
  group_by(inverter) %>%
  time_decompose(daily_yield, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose()

ggplotly(
  p1.gd.anomalize %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'daily yield')
  ) %>% layout(showlegend = FALSE)
-10,000010,00020,000-10,000010,00020,000-10,000010,00020,000-10,000010,00020,000-10,000010,00020,000-10,000010,00020,000-10,000010,00020,000-10,000010,000-10,000010,00020,000-10,000010,000May 18May 25Jun 01Jun 08Jun 15-10,000-5,00005,00010,000-10,000010,00020,000-10,000010,00020,000-10,000010,00020,000-10,000010,00020,000-10,000010,00020,000-10,000010,000-10,000010,000-10,000010,00020,000-10,000010,00020,000-10,000010,000May 18May 25Jun 01Jun 08Jun 15-10,000010,000
daily yieldbvBOhCH3iADSZryZoEaEvLYb1n2sOqsjndEbLyjtCKgGvYxYtjZvoooNbGkE7JYdWkrLSPkdwr4uHbuxQJl8lW7ozc3PZuoBAID5Wc2HDMcdE0feGgRqW7Capkci93gMrogZuBjz9Y9gH1T5YWrNuGih0vzX44oOqAx2fadLQvlD726eNBSBWRmjgnKYAwPKWDb1IF53ai7Xc0U56YiCRJl6heRkivqQ3VHMLBKoKgIrUVDUrGa61gmuvPhdLxVwCURE6d3bPkepu2zBIq5rxdHJRwDNYZnxXDlPa8U1GXgEzVJPv84UY57bAof1BY6WEcLGh8j5v7anomaly

Power Generation Inverter data set: Plant 2

Get and Clean Data

p2.gd = read_csv('Plant_2_Generation_Data.csv') %>%
  slice_sample(prop = 0.10) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(where(is.POSIXct), where(is.factor),where(is.character),where(is.numeric)) #sort cols by data type

#OlsonNames()
#https://stackoverflow.com/questions/41479008/what-is-the-correct-tz-database-time-zone-for-india

p2.gd = p2.gd %>% mutate(
  source_key = factor(p2.gd$source_key),
) %>% rename(inverter = source_key)

p2.gd$plant_id = NULL

glimpse structure, and sample rows

p2.gd %>% slice_sample(n = 10) %>% DT::datatable()
p2.gd %>% glimpse()
## Rows: 6,769
## Columns: 6
## $ date_time   <dttm> 2020-05-20 21:30:00, 2020-06-05 22:30:00, 2020-06-02 2...
## $ inverter    <fct> Quc1TzYxW2pYoWX, oZ35aAeoifZaQzV, WcxssY2VbP4hApt, rrq4...
## $ ac_power    <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,...
## $ daily_yield <dbl> 2419.0000, 7708.0000, 3877.0000, 7271.0000, 0.0000, 0.0...
## $ dc_power    <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,...
## $ total_yield <dbl> 329530672, 1660126361, 181838353, 121118558, 593698089,...

check missing values

p2.gd %>% miss_var_summary()
## # A tibble: 6 x 3
##   variable    n_miss pct_miss
##   <chr>        <int>    <dbl>
## 1 date_time        0        0
## 2 inverter         0        0
## 3 ac_power         0        0
## 4 daily_yield      0        0
## 5 dc_power         0        0
## 6 total_yield      0        0

EDA: Factor Vars

counts each factor’s unique levels

sapply(p2.gd %>% select(where(is.factor)), n_unique) %>% as.data.frame()
##           .
## inverter 22

reference: names of unique levels

sapply(p2.gd %>% select(where(is.factor)), unique) %>% as.data.frame() %>% arrange()
##           inverter
## 1  Quc1TzYxW2pYoWX
## 2  oZ35aAeoifZaQzV
## 3  WcxssY2VbP4hApt
## 4  rrq4fwE8jgrTyWY
## 5  mqwcsP2rE7J0TFp
## 6  V94E5Ben1TlhnDV
## 7  LYwnQax7tkwH5Cb
## 8  PeE6FRyGXUgsRhN
## 9  xoJJ8DcxJEcupym
## 10 vOuJvMaM2sgwLmb
## 11 Mx2yZCDsyf6DPfv
## 12 Qf4GUc1pJu5T6c6
## 13 IQ2d7wF4YD8zU1Q
## 14 q49J1IKaHRwDQnt
## 15 9kRcWv60rDACzjR
## 16 oZZkBaNadn6DNKz
## 17 NgDl19wMapZy17u
## 18 4UPUqMRk7TRMgml
## 19 Et9kgGMDl729KT4
## 20 LlT2YUhhzqhg5Sw
## 21 81aHJ1q11NBPMrL
## 22 xMbIugepa2P7lBB

viz: distribution of level counts

jpal = colorRampPalette(brewer.pal(8,'Dark2'))(22)

p2.gd %>% count(inverter) %>% plot_ly(y = ~fct_reorder(inverter,n), x = ~n, color = ~inverter, colors = jpal) %>% add_bars(hoverinfo = 'text', text = ~n) %>% hide_legend() %>% layout(
    title = 'Source Key Counts',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    ) 
050100150200250300350mqwcsP2rE7J0TFpxMbIugepa2P7lBBIQ2d7wF4YD8zU1QNgDl19wMapZy17uoZ35aAeoifZaQzVMx2yZCDsyf6DPfvQuc1TzYxW2pYoWXQf4GUc1pJu5T6c6oZZkBaNadn6DNKzEt9kgGMDl729KT4PeE6FRyGXUgsRhNrrq4fwE8jgrTyWYvOuJvMaM2sgwLmbV94E5Ben1TlhnDVq49J1IKaHRwDQntLlT2YUhhzqhg5SwWcxssY2VbP4hAptxoJJ8DcxJEcupym9kRcWv60rDACzjR4UPUqMRk7TRMgmlLYwnQax7tkwH5Cb81aHJ1q11NBPMrL
Source Key Counts

Sanity Check - PASS - Pretty much a uniform distribution

EDA: Numeric Vars

viz bivariate numeric distribution

DataExplorer::plot_boxplot(p2.gd %>% select(where(is.numeric)), by = 'daily_yield')

viz: numeric univariate distributions

names.numeric = p2.gd %>% select(where(is.numeric)) %>% names

p2.gd %>% dlookr::plot_normality(
  names.numeric[1],
  names.numeric[2],
  names.numeric[3],
  names.numeric[4]
  )

Seems to be many outliers, especially 0s, in several features. Indicative of faulty inverters? Plot by inverter

viz: numeric univariate distributions

DataExplorer::plot_density(p2.gd %>% select(where(is.numeric)))

viz: distributions by ‘inverter’ factor

p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = total_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~total_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Total Yield by Inverter')
00.5B1B1.5B2B9kRcWv60rDACzjRLYwnQax7tkwH5CboZZkBaNadn6DNKzoZ35aAeoifZaQzVV94E5Ben1TlhnDVPeE6FRyGXUgsRhN81aHJ1q11NBPMrLQf4GUc1pJu5T6c6mqwcsP2rE7J0TFpQuc1TzYxW2pYoWXLlT2YUhhzqhg5SwxoJJ8DcxJEcupymWcxssY2VbP4hAptrrq4fwE8jgrTyWYNgDl19wMapZy17uxMbIugepa2P7lBBIQ2d7wF4YD8zU1QMx2yZCDsyf6DPfv4UPUqMRk7TRMgmlvOuJvMaM2sgwLmbEt9kgGMDl729KT4q49J1IKaHRwDQnt
Distribution of Total Yield by Inverter
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = daily_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~daily_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Daily Yield by Inverter')
02k4k6k8k10kxMbIugepa2P7lBB4UPUqMRk7TRMgmlIQ2d7wF4YD8zU1QNgDl19wMapZy17uq49J1IKaHRwDQntvOuJvMaM2sgwLmbMx2yZCDsyf6DPfvQf4GUc1pJu5T6c6oZ35aAeoifZaQzVxoJJ8DcxJEcupymPeE6FRyGXUgsRhNLlT2YUhhzqhg5SwQuc1TzYxW2pYoWXoZZkBaNadn6DNKzWcxssY2VbP4hAptLYwnQax7tkwH5Cbrrq4fwE8jgrTyWYEt9kgGMDl729KT49kRcWv60rDACzjR81aHJ1q11NBPMrLV94E5Ben1TlhnDVmqwcsP2rE7J0TFp
Distribution of Daily Yield by Inverter
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = ac_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~ac_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of AC Power by Inverter')
0200400600800100012001400xMbIugepa2P7lBBNgDl19wMapZy17uQf4GUc1pJu5T6c6Mx2yZCDsyf6DPfvvOuJvMaM2sgwLmbmqwcsP2rE7J0TFp4UPUqMRk7TRMgml81aHJ1q11NBPMrL9kRcWv60rDACzjREt9kgGMDl729KT4IQ2d7wF4YD8zU1QLlT2YUhhzqhg5SwLYwnQax7tkwH5CboZ35aAeoifZaQzVoZZkBaNadn6DNKzPeE6FRyGXUgsRhNq49J1IKaHRwDQntQuc1TzYxW2pYoWXrrq4fwE8jgrTyWYV94E5Ben1TlhnDVWcxssY2VbP4hAptxoJJ8DcxJEcupym
Distribution of AC Power by Inverter
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = dc_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~dc_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of DC Power by Inverter')
0200400600800100012001400xMbIugepa2P7lBBNgDl19wMapZy17uQf4GUc1pJu5T6c6Mx2yZCDsyf6DPfvvOuJvMaM2sgwLmbmqwcsP2rE7J0TFp4UPUqMRk7TRMgml81aHJ1q11NBPMrL9kRcWv60rDACzjREt9kgGMDl729KT4IQ2d7wF4YD8zU1QLlT2YUhhzqhg5SwLYwnQax7tkwH5CboZ35aAeoifZaQzVoZZkBaNadn6DNKzPeE6FRyGXUgsRhNq49J1IKaHRwDQntQuc1TzYxW2pYoWXrrq4fwE8jgrTyWYV94E5Ben1TlhnDVWcxssY2VbP4hAptxoJJ8DcxJEcupym
Distribution of DC Power by Inverter

viz: ‘Rest?/Maintenace? Days’ Check

p2.gd %>% arrange(date_time) %>% group_nest(inverter) %>% unnest
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(data)`
## # A tibble: 6,769 x 6
##    inverter        date_time           ac_power daily_yield dc_power total_yield
##    <fct>           <dttm>                 <dbl>       <dbl>    <dbl>       <dbl>
##  1 4UPUqMRk7TRMgml 2020-05-15 01:00:00      0       7540.        0      2429011 
##  2 4UPUqMRk7TRMgml 2020-05-15 01:15:00      0          0         0      2429011 
##  3 4UPUqMRk7TRMgml 2020-05-15 02:00:00      0          0         0      2429011 
##  4 4UPUqMRk7TRMgml 2020-05-15 02:30:00      0          0         0      2429011 
##  5 4UPUqMRk7TRMgml 2020-05-15 06:15:00     26.5        6.13     27.4    2429017.
##  6 4UPUqMRk7TRMgml 2020-05-15 10:15:00      0       2171         0      2431182 
##  7 4UPUqMRk7TRMgml 2020-05-15 10:30:00      0       2171         0      2431182 
##  8 4UPUqMRk7TRMgml 2020-05-15 12:00:00      0       2171         0      2431182 
##  9 4UPUqMRk7TRMgml 2020-05-15 12:30:00      0       2171         0      2431182 
## 10 4UPUqMRk7TRMgml 2020-05-15 14:00:00      0       2750         0      2431761 
## # ... with 6,759 more rows
p2.gd.plots = p2.gd %>% arrange(date_time) %>% group_nest(inverter) %>% mutate(
  plots = map2(
    .x = data,
    .y = inverter,
    ~ggplot(data = .x, aes(date_time, daily_yield, color = daily_yield == 0, group = 1)) +
      geom_line(size = 1.2) + scale_color_manual(values = c('black','red')) + ggtitle(paste0('Inverter: ', .y))
  ))

p2.gd.plots$plots
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

correlations: viz

p2.gd %>% dlookr::plot_correlate()

since dc and ac power are just perfectly convertible (like fahrenheit and celsius), they have a perfect correlation

EDA: Time Series Viz

Anomoly Plot

library(scales)
library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

p2.gd.anomalize = p2.gd %>% arrange(date_time) %>% 
  mutate(inverter = fct_reorder(inverter, -daily_yield)) %>% 
  group_by(inverter) %>%
  time_decompose(daily_yield, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose()

ggplotly(
  p2.gd.anomalize %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'daily yield')
  ) %>% layout(showlegend = FALSE)
010,000-10,000010,00020,000010,000-10,000010,00020,000-10,000-5,00005,00010,00015,000-10,000010,000-5,00005,00010,000-5,00005,00010,000-5,00005,00010,000-10,000-5,00005,00010,00015,000May 18May 25Jun 01Jun 08Jun 15-5,00005,00010,000-10,000010,00020,000-10,000010,000-10,000010,000-10,000010,000-10,000-5,00005,00010,00015,000-10,000-5,00005,00010,00015,000-10,000-5,00005,00010,00015,000-4,00004,0008,00012,000-5,00005,00010,000-10,000-5,00005,00010,00015,000May 18May 25Jun 01Jun 08Jun 15-10,000010,00020,000
daily yieldxMbIugepa2P7lBB4UPUqMRk7TRMgmlIQ2d7wF4YD8zU1QNgDl19wMapZy17uq49J1IKaHRwDQntvOuJvMaM2sgwLmbMx2yZCDsyf6DPfvQf4GUc1pJu5T6c6oZ35aAeoifZaQzVxoJJ8DcxJEcupymPeE6FRyGXUgsRhNLlT2YUhhzqhg5SwQuc1TzYxW2pYoWXoZZkBaNadn6DNKzWcxssY2VbP4hAptLYwnQax7tkwH5Cbrrq4fwE8jgrTyWYEt9kgGMDl729KT49kRcWv60rDACzjR81aHJ1q11NBPMrLV94E5Ben1TlhnDVmqwcsP2rE7J0TFpanomaly

DATA DICTIONARY for Sensor Reading data sets

  1. IRRADIATION: Amount of irradiation for the 15 minute interval.
  2. DATE_TIME: Date and time for each observation. Observations recorded at 15 minute intervals.
  3. PLANT_ID: Plant ID - this will be common for the entire file.
  4. SOURCE_KEY: Stands for the sensor panel id. This will be common for the entire file because there’s only one sensor panel for the plant.
  5. MODULE_TEMPERATURE: There’s a module (solar panel) attached to the sensor panel. This is the temperature reading for that module.
  6. AMBIENT_TEMPERATURE: This is the ambient temperature at the plant.

Sensor Data Set: Plant 1

Get and Clean Data

p1.ws = read_csv('Plant_1_Weather_Sensor_Data.csv') %>%
  slice_sample(prop = 0.10) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  mutate(across(where(is.character),factor)) %>% 
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(date_time, where(is.factor), where(is.numeric)) %>% #sort cols by data type
  arrange(date_time)

p1.ws$plant_id = NULL
p1.ws$source_key = NULL

sample rows

p1.ws %>% slice_sample(n = 10) %>% DT::datatable()

glimpse structure

p1.ws %>% glimpse()
## Rows: 318
## Columns: 4
## $ date_time           <dttm> 2020-05-15 00:15:00, 2020-05-15 04:30:00, 2020...
## $ ambient_temperature <dbl> 25.08459, 24.06262, 24.17711, 25.95908, 27.9883...
## $ irradiation         <dbl> 0.00000000, 0.00000000, 0.00000000, 0.34570765,...
## $ module_temperature  <dbl> 22.76167, 21.85252, 22.55191, 35.52871, 46.6177...

check missing values

p1.ws %>% miss_var_summary
## # A tibble: 4 x 3
##   variable            n_miss pct_miss
##   <chr>                <int>    <dbl>
## 1 date_time                0        0
## 2 ambient_temperature      0        0
## 3 irradiation              0        0
## 4 module_temperature       0        0

EDA: Numeric Vars

viz: numeric univariate distributions

p1.ws %>% select(where(is.numeric)) %>% dlookr::plot_normality()

viz: numeric univariate outlier check

p1.ws %>% select(where(is.numeric)) %>% dlookr::diagnose_outlier()
## # A tibble: 3 x 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 ambient_temp~            0           0           NaN       25.6         25.6  
## 2 irradiation              4           1.26          1.10     0.223        0.212
## 3 module_tempe~            0           0           NaN       31.0         31.0
p1.ws %>% select(where(is.numeric)) %>% DataExplorer::plot_density()

correlations: viz

p1.ws %>% dlookr::plot_correlate()

p1.ws %>% select(where(is.numeric)) %>% GGally::ggcorr(low = '#990000', mid = '#E0E0E0', high = '#009900', label = TRUE)

viz: Time Series w/IQR percentiles

ggplotly(p1.ws %>% ggplot(aes(date_time, ambient_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$ambient_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'ambient_temperature')
May 18May 25Jun 01Jun 08Jun 1520253035
ambient_temperaturedate_timeambient_temperature
ggplotly(p1.ws %>% ggplot(aes(date_time, irradiation)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$irradiation, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'irradiation')
May 18May 25Jun 01Jun 08Jun 150.00.30.60.91.2
irradiationdate_timeirradiation
ggplotly(p1.ws %>% ggplot(aes(date_time, module_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p1.ws$module_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'module_temperature')
May 18May 25Jun 01Jun 08Jun 152030405060
module_temperaturedate_timemodule_temperature
p1.ws %>% pivot_longer(ambient_temperature:module_temperature) %>% ggplot(aes(date_time, value, color = name)) + geom_line(size = 1.1) + facet_wrap(~name, ncol = 1, scales = 'free_y') + theme(legend.position = 'none')

Anomoly Plot

library(scales)
library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(ambient_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.15, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'ambient_temperature')
  ) %>% layout(showlegend = FALSE, title = 'ambient_temperature')
May 18May 25Jun 01Jun 08Jun 152030
ambient_temperatureambient_temperatureanomaly
ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(irradiation, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'irradiation')
  ) %>% layout(showlegend = FALSE, title = 'irradiation')
May 18May 25Jun 01Jun 08Jun 150.000.300.600.901.20
irradiationirradiationanomaly
ggplotly(
  p1.ws %>% arrange(date_time) %>% 
  time_decompose(module_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'module_temperature')
  ) %>% layout(showlegend = FALSE, title = 'module_temperature')
May 18May 25Jun 01Jun 08Jun 15204060
module_temperaturemodule_temperatureanomaly

Sensor Data Set: Plant 2

Get and Clean Data

p2.ws = read_csv('Plant_2_Weather_Sensor_Data.csv') %>%
  slice_sample(prop = 0.10) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  mutate(across(where(is.character),factor)) %>% 
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(date_time, where(is.factor), where(is.numeric)) %>% #sort cols by data type
  arrange(date_time)

p2.ws$plant_id = NULL
p2.ws$source_key = NULL

sample rows

p2.ws %>% slice_sample(n = 10) %>% DT::datatable()

glimpse structure

p2.ws %>% glimpse()
## Rows: 325
## Columns: 4
## $ date_time           <dttm> 2020-05-15 00:15:00, 2020-05-15 01:15:00, 2020...
## $ ambient_temperature <dbl> 26.88081, 26.51274, 25.48205, 29.31902, 34.2518...
## $ irradiation         <dbl> 0.000000000, 0.000000000, 0.000000000, 0.625399...
## $ module_temperature  <dbl> 24.42187, 25.31797, 24.79669, 40.73804, 54.8943...

check missing values

p2.ws %>% miss_var_summary
## # A tibble: 4 x 3
##   variable            n_miss pct_miss
##   <chr>                <int>    <dbl>
## 1 date_time                0        0
## 2 ambient_temperature      0        0
## 3 irradiation              0        0
## 4 module_temperature       0        0

EDA: Numeric Vars

viz: numeric univariate distributions

p2.ws %>% select(where(is.numeric)) %>% dlookr::plot_normality()

viz: numeric univariate outlier check

p2.ws %>% select(where(is.numeric)) %>% dlookr::diagnose_outlier()
## # A tibble: 3 x 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 ambient_temp~            0              0           NaN    28.4         28.4  
## 2 irradiation              0              0           NaN     0.249        0.249
## 3 module_tempe~            0              0           NaN    33.4         33.4
p2.ws %>% select(where(is.numeric)) %>% DataExplorer::plot_density()

correlations: viz

p2.ws %>% dlookr::plot_correlate()

p2.ws %>% select(where(is.numeric)) %>% GGally::ggcorr(low = '#990000', mid = '#E0E0E0', high = '#009900', label = TRUE)

viz: Time Series w/IQR percentiles

ggplotly(p2.ws %>% ggplot(aes(date_time, ambient_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$ambient_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'ambient_temperature')
May 18May 25Jun 01Jun 08Jun 15253035
ambient_temperaturedate_timeambient_temperature
ggplotly(p2.ws %>% ggplot(aes(date_time, irradiation)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$irradiation, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'irradiation')
May 18May 25Jun 01Jun 08Jun 150.00.30.60.9
irradiationdate_timeirradiation
ggplotly(p2.ws %>% ggplot(aes(date_time, module_temperature)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.25), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.50), lty = 'dashed', alpha = 0.7, color = 'blue', size = 1.1) +
  geom_hline(yintercept = quantile(p2.ws$module_temperature, 0.75), lty = 'dashed', alpha = 0.7, color = 'red', size = 1.1)
) %>% layout(title = 'module_temperature')
May 18May 25Jun 01Jun 08Jun 152030405060
module_temperaturedate_timemodule_temperature
p2.ws %>% pivot_longer(ambient_temperature:module_temperature) %>% ggplot(aes(date_time, value, color = name)) + geom_line(size = 1.1) + facet_wrap(~name, ncol = 1, scales = 'free_y') + theme(legend.position = 'none')

Anomoly Plot

library(scales)
library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(ambient_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.15, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'ambient_temperature')
  ) %>% layout(showlegend = FALSE, title = 'ambient_temperature')
May 18May 25Jun 01Jun 08Jun 1510203040
ambient_temperatureambient_temperatureanomaly
ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(irradiation, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'irradiation')
  ) %>% layout(showlegend = FALSE, title = 'irradiation')
May 18May 25Jun 01Jun 08Jun 150.000.400.80
irradiationirradiationanomaly
ggplotly(
  p2.ws %>% arrange(date_time) %>% 
  time_decompose(module_temperature, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose() %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'module_temperature')
  ) %>% layout(showlegend = FALSE, title = 'module_temperature')
May 18May 25Jun 01Jun 08Jun 150204060
module_temperaturemodule_temperatureanomaly